Córdoba Province
A Consensus Privacy Metrics Framework for Synthetic Data
Pilgram, Lisa, Dankar, Fida K., Drechsler, Jorg, Elliot, Mark, Domingo-Ferrer, Josep, Francis, Paul, Kantarcioglu, Murat, Kong, Linglong, Malin, Bradley, Muralidhar, Krishnamurty, Myles, Puja, Prasser, Fabian, Raisaro, Jean Louis, Yan, Chao, Emam, Khaled El
Synthetic data generation is one approach for sharing individual-level data. However, to meet legislative requirements, it is necessary to demonstrate that the individuals' privacy is adequately protected. There is no consolidated standard for measuring privacy in synthetic data. Through an expert panel and consensus process, we developed a framework for evaluating privacy in synthetic data. Our findings indicate that current similarity metrics fail to measure identity disclosure, and their use is discouraged. For differentially private synthetic data, a privacy budget other than close to zero was not considered interpretable. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information about an individual without necessarily revealing their identity. The resultant framework provides precise recommendations for metrics that address these types of disclosures effectively. Our findings further present specific opportunities for future research that can help with widespread adoption of synthetic data.
Splitting criteria for ordinal decision trees: an experimental study
Ayllón-Gavilán, Rafael, Martínez-Estudillo, Francisco José, Guijo-Rubio, David, Hervás-Martínez, César, Gutiérrez, Pedro Antonio
Ordinal Classification (OC) is a machine learning field that addresses classification tasks where the labels exhibit a natural order. Unlike nominal classification, which treats all classes as equally distinct, OC takes the ordinal relationship into account, producing more accurate and relevant results. This is particularly critical in applications where the magnitude of classification errors has implications. Despite this, OC problems are often tackled using nominal methods, leading to suboptimal solutions. Although decision trees are one of the most popular classification approaches, ordinal tree-based approaches have received less attention when compared to other classifiers. This work conducts an experimental study of tree-based methodologies specifically designed to capture ordinal relationships. A comprehensive survey of ordinal splitting criteria is provided, standardising the notations used in the literature for clarity. Three ordinal splitting criteria, Ordinal Gini (OGini), Weighted Information Gain (WIG), and Ranking Impurity (RI), are compared to the nominal counterparts of the first two (Gini and information gain), by incorporating them into a decision tree classifier. An extensive repository considering 45 publicly available OC datasets is presented, supporting the first experimental comparison of ordinal and nominal splitting criteria using well-known OC evaluation metrics. Statistical analysis of the results highlights OGini as the most effective ordinal splitting criterion to date. Source code, datasets, and results are made available to the research community.
Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting
Chłopowiec, Adrian B., Chłopowiec, Adam R., Galus, Krzysztof, Cebula, Wojciech, Tabakov, Martin
Limited medical imaging datasets challenge deep learning models by increasing risks of overfitting and reduced generalization, particularly in Generative Adversarial Networks (GANs), where discriminators may overfit, leading to training divergence. This constraint also impairs classification models trained on small datasets. Generative Data Augmentation (GDA) addresses this by expanding training datasets with synthetic data, although it requires training a generative model. We propose and evaluate two local lesion generation approaches to address the challenge of augmenting small medical image datasets. The first approach employs the Poisson Image Editing algorithm, a classical image processing technique, to create realistic image composites that outperform current state-of-the-art methods. The second approach introduces a novel generative method, leveraging a fine-tuned Image Inpainting GAN to synthesize realistic lesions within specified regions of real training images. A comprehensive comparison of the two proposed methods demonstrates that effective local lesion generation in a data-constrained setting allows for reaching new state-of-the-art results in capsule endoscopy lesion classification. Combination of our techniques achieves a macro F1-score of 33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule endoscopy. To the best of our knowledge, this work is the first to apply a fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that an image-conditional GAN can be adapted effectively to limited datasets to generate high-quality examples, facilitating effective data augmentation. Additionally, we show that combining this GAN-based approach with classical image processing techniques further improves the results.
Identifying Key Drivers of Heatwaves: A Novel Spatio-Temporal Framework for Extreme Event Detection
Pérez-Aracil, J., Peláez-Rodríguez, C., McAdam, Ronan, Squintu, Antonello, Marina, Cosmin M., Lorente-Ramos, Eugenio, Luther, Niklas, Torralba, Veronica, Scoccimarro, Enrico, Cavicchia, Leone, Giuliani, Matteo, Zorita, Eduardo, Hansen, Felicitas, Barriopedro, David, Garcia-Herrera, Ricardo, Gutiérrez, Pedro A., Luterbacher, Jürg, Xoplaki, Elena, Castelletti, Andrea, Salcedo-Sanz, S.
Heatwaves (HWs) are extreme atmospheric events that produce significant societal and environmental impacts. Predicting these extreme events remains challenging, as their complex interactions with large-scale atmospheric and climatic variables are difficult to capture with traditional statistical and dynamical models. This work presents a general method for driver identification in extreme climate events. A novel framework (STCO-FS) is proposed to identify key immediate (short-term) HW drivers by combining clustering algorithms with an ensemble evolutionary algorithm. The framework analyzes spatio-temporal data, reduces dimensionality by grouping similar geographical nodes for each variable, and develops driver selection in spatial and temporal domains, identifying the best time lags between predictive variables and HW occurrences. The proposed method has been applied to analyze HWs in the Adda river basin in Italy. The approach effectively identifies significant variables influencing HWs in this region. This research can potentially enhance our understanding of HW drivers and predictability.
STIED: A deep learning model for the SpatioTemporal detection of focal Interictal Epileptiform Discharges with MEG
Fernández-Martín, Raquel, Gijón, Alfonso, Feys, Odile, Juvené, Elodie, Aeby, Alec, Urbain, Charline, De Tiège, Xavier, Wens, Vincent
Magnetoencephalography (MEG) allows the non-invasive detection of interictal epileptiform discharges (IEDs). Clinical MEG analysis in epileptic patients traditionally relies on the visual identification of IEDs, which is time consuming and partially subjective. Automatic, data-driven detection methods exist but show limited performance. Still, the rise of deep learning (DL)-with its ability to reproduce human-like abilities-could revolutionize clinical MEG practice. Here, we developed and validated STIED, a simple yet powerful supervised DL algorithm combining two convolutional neural networks with temporal (1D time-course) and spatial (2D topography) features of MEG signals inspired from current clinical guidelines. Our DL model enabled both temporal and spatial localization of IEDs in patients suffering from focal epilepsy with frequent and high amplitude spikes (FE group), with high-performance metrics-accuracy, specificity, and sensitivity all exceeding 85%-when learning from spatiotemporal features of IEDs. This performance can be attributed to our handling of input data, which mimics established clinical MEG practice. Reverse engineering further revealed that STIED encodes fine spatiotemporal features of IEDs rather than their mere amplitude. The model trained on the FE group also showed promising results when applied to a separate group of presurgical patients with different types of refractory focal epilepsy, though further work is needed to distinguish IEDs from physiological transients. This study paves the way of incorporating STIED and DL algorithms into the routine clinical MEG evaluation of epilepsy.
Improving the portability of predicting students performance models by using ontologies
Zambrano, Javier Lopez, Lara, Juan A., Romero, Cristobal
One of the main current challenges in Educational Data Mining and Learning Analytics is the portability or transferability of predictive models obtained for a particular course so that they can be applied to other different courses. To handle this challenge, one of the foremost problems is the models excessive dependence on the low-level attributes used to train them, which reduces the models portability. To solve this issue, the use of high level attributes with more semantic meaning, such as ontologies, may be very useful. Along this line, we propose the utilization of an ontology that uses a taxonomy of actions that summarises students interactions with the Moodle learning management system. We compare the results of this proposed approach against our previous results when we used low-level raw attributes obtained directly from Moodle logs. The results indicate that the use of the proposed ontology improves the portability of the models in terms of predictive accuracy. The main contribution of this paper is to show that the ontological models obtained in one source course can be applied to other different target courses with similar usage levels without losing prediction accuracy.
Improving the classification of extreme classes by means of loss regularisation and generalised beta distributions
Vargas, Víctor Manuel, Gutiérrez, Pedro Antonio, Barbero-Gómez, Javier, Hervás-Martínez, César
An ordinal classification problem is one in which the target variable takes values on an ordinal scale. Nowadays, there are many of these problems associated with real-world tasks where it is crucial to accurately classify the extreme classes of the ordinal structure. In this work, we propose a unimodal regularisation approach that can be applied to any loss function to improve the classification performance of the first and last classes while maintaining good performance for the remainder. The proposed methodology is tested on six datasets with different numbers of classes, and compared with other unimodal regularisation methods in the literature. In addition, performance in the extreme classes is compared using a new metric that takes into account their sensitivities. Experimental results and statistical analysis show that the proposed methodology obtains a superior average performance considering different metrics. The results for the proposed metric show that the generalised beta distribution generally improves classification performance in the extreme classes. At the same time, the other five nominal and ordinal metrics considered show that the overall performance is aligned with the performance of previous alternatives.
A Two-stage Reinforcement Learning-based Approach for Multi-entity Task Allocation
Gong, Aicheng, Yang, Kai, Lyu, Jiafei, Li, Xiu
Task allocation is a key combinatorial optimization problem, crucial for modern applications such as multi-robot cooperation and resource scheduling. Decision makers must allocate entities to tasks reasonably across different scenarios. However, traditional methods assume static attributes and numbers of tasks and entities, often relying on dynamic programming and heuristic algorithms for solutions. In reality, task allocation resembles Markov decision processes, with dynamically changing task and entity attributes. Thus, algorithms must dynamically allocate tasks based on their states. To address this issue, we propose a two-stage task allocation algorithm based on similarity, utilizing reinforcement learning to learn allocation strategies. The proposed pre-assign strategy allows entities to preselect appropriate tasks, effectively avoiding local optima and thereby better finding the optimal allocation. We also introduce an attention mechanism and a hyperparameter network structure to adapt to the changing number and attributes of entities and tasks, enabling our network structure to generalize to new tasks. Experimental results across multiple environments demonstrate that our algorithm effectively addresses the challenges of dynamic task allocation in practical applications. Compared to heuristic algorithms like genetic algorithms, our reinforcement learning approach better solves dynamic allocation problems and achieves zero-shot generalization to new tasks with good performance. The code is available at https://github.com/yk7333/TaskAllocation.
aeon: a Python toolkit for learning from time series
Middlehurst, Matthew, Ismail-Fawaz, Ali, Guillaume, Antoine, Holder, Christopher, Rubio, David Guijo, Bulatova, Guzal, Tsaprounis, Leonidas, Mentel, Lukasz, Walter, Martin, Schäfer, Patrick, Bagnall, Anthony
aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at https://github.com/ aeon-toolkit/aeon. This version was submitted to the JMLR journal on 02 Nov 2023 for v0.5.0 of aeon. At the time of this preprint aeon has released v0.9.0, and has had substantial changes.
Towards Explainable Test Case Prioritisation with Learning-to-Rank Models
Ramírez, Aurora, Berrios, Mario, Romero, José Raúl, Feldt, Robert
Test case prioritisation (TCP) is a critical task in regression testing to ensure quality as software evolves. Machine learning has become a common way to achieve it. In particular, learning-to-rank (LTR) algorithms provide an effective method of ordering and prioritising test cases. However, their use poses a challenge in terms of explainability, both globally at the model level and locally for particular results. Here, we present and discuss scenarios that require different explanations and how the particularities of TCP (multiple builds over time, test case and test suite variations, etc.) could influence them. We include a preliminary experiment to analyse the similarity of explanations, showing that they do not only vary depending on test case-specific predictions, but also on the relative ranks.